Fix background database account refresh stopping in multi-writer accounts by jeet1995 · Pull Request #48758 · Azure/azure-sdk-for-java

jeet1995 · 2026-04-10T00:51:02Z

Problem

The GlobalEndpointManager background refresh timer silently stops in multi-writer accounts, preventing the SDK from detecting topology changes (e.g., multi-write to single-write transitions).

Root Cause

In refreshLocationPrivateAsync(), when LocationCache.shouldRefreshEndpoints() returns false, the timer is never restarted:

} else {
    logger.debug("shouldRefreshEndpoints: false, nothing to do.");
    this.isRefreshing.set(false);
    return Mono.empty(); // timer dies here
}

For multi-writer accounts, shouldRefreshEndpoints() returns false when the preferred write endpoint matches the current primary -- a steady-state condition. Once that happens, no further background refreshes occur for the lifetime of the client. Bug has existed since PR #6139 (Nov 2019, point #4 in description).

Behavioral Difference with .NET SDK

The .NET SDK handles this correctly in StartLocationBackgroundRefreshLoop() -- it only terminates when canRefreshInBackground is explicitly false, continuing even when ShouldRefreshEndpoints() returns false.

Fix

Add startRefreshLocationTimerAsync() to the else branch of refreshLocationPrivateAsync():

} else {
    logger.debug("shouldRefreshEndpoints: false, nothing to do.");
    if (!this.refreshInBackground.get()) {
        this.startRefreshLocationTimerAsync();
    }
    this.isRefreshing.set(false);
    return Mono.empty();
}

Unit Tests

6/6 pass:

backgroundRefreshForMultiMaster: Updated assertion -- timer must keep running
backgroundRefreshDetectsTopologyChangeForMultiMaster: New -- simulates MW-to-SW transition via mock

Live DR Drill Validation (4 Scenarios)

Date: 2026-04-10 22:10Z -- 2026-04-11 00:32Z | Branch: fix/background-refresh-multi-writer @ 2048abeca

All scenarios used Direct + Gateway modes simultaneously. Kusto data from BackendEndRequest5M (Direct) and Request5M (Gateway).

Accounts

Account	Type	Regions
`bgrefresh-mw-test-440`	Multi-writer	East US (hub) + West US
`bgrefresh-sw-test-440`	Single-writer	East US (write) + West US (read)

Scenario 1: MW -- Offline Secondary Region

Global endpoint, preferred = West US. Offline West US, observe failover to East US.

PASS -- Failover to East US in ~4 min. 32 GEM refreshes. West US traffic resumed after restore.

Scenario 2: MW -- MW-to-SW-to-MW Transition (Core PR validation)

Regional endpoint (westus.documents.azure.com), no preferred region. Disable then re-enable multi-write.

PASS -- Both transitions detected. MW-to-SW in ~3.5 min (writes shifted to EUS). SW-to-MW in ~1 min (writes returned to WUS). 28 GEM refreshes.

Scenario 3: SW -- Switch Write Region

Global endpoint, preferred = East US. Switch write EUS-to-WUS.

PASS -- Writes on WUS within 1 Kusto bucket. 20 GEM refreshes.

Scenario 4: SW -- Offline Write Region

Global endpoint, preferred = East US. Offline East US.

PASS -- Full failover to WUS in ~3 min. 32 GEM refreshes.

Backend Success Rates

Direct mode (`BackendEndRequest5M`)

Scenario	Workload	Total	Success	Rate
S1 MW Offline	`dr-off-direct-write`	262,576	262,548	99.989%
S1 MW Offline	`dr-off-direct-read`	292,092	290,830	99.568%
S2 MW Trans.	`dr-mwsw-direct-write`	175,272	--	--
S2 MW Trans.	`dr-mwsw-direct-read`	131,586	--	--
S3 SW Switch	`dr-direct-write`	142,567	142,499	99.952%
S3 SW Switch	`dr-direct-read`	251,072	247,366	98.524%
S4 SW Offline	`dr-off-direct-write`	197,669	197,633	99.982%
S4 SW Offline	`dr-off-direct-read`	232,202	226,599	97.587%

Gateway mode (`Request5M`)

Scenario	Workload	Total	Success	Rate
S1 MW Offline	`dr-off-gw-write`	469,579	469,518	99.987%
S1 MW Offline	`dr-off-gw-read`	557,311	557,307	99.999%
S2 MW Trans.	`dr-mwsw-gw-write`	147,864	146,494	99.073%
S2 MW Trans.	`dr-mwsw-gw-read`	196,383	196,383	100.0%
S3 SW Switch	`dr-gw-write`	133,214	133,146	99.949%
S3 SW Switch	`dr-gw-read`	231,657	231,657	100.0%
S4 SW Offline	`dr-off-gw-write`	(included in S1 totals)
S4 SW Offline	`dr-off-gw-read`	(included in S1 totals)

All errors (403/3 write-to-read-only, 404/1002 session-not-available) were auto-retried by the SDK -- zero user-visible failures.

Verdict

Scenario	Failover	GEM Refreshes	Direct Write %	GW Write %	Verdict
MW Offline Secondary	~4 min to EUS	32	99.989%	99.987%	PASS
MW-to-SW-to-MW	~3.5 min / ~1 min	28	--	99.073%	PASS
SW Switch Write	< 1 bucket	20	99.952%	99.949%	PASS
SW Offline Write	~3 min to WUS	32	99.982%	99.987%	PASS

Kusto Queries Used

// Direct mode ops (BackendEndRequest5M)
BackendEndRequest5M
| where TIMESTAMP between (datetime({start}) .. datetime({end}))
| where GlobalDatabaseAccountName == '{account}'
| where UserAgent has 'dr-' | where ResourceType == 2
| extend Workload = extract('(dr-[a-z-]+-[a-z]+)', 1, UserAgent)
| summarize Requests=sum(SampleCount) by bin(TIMESTAMP, 5m), Region, Workload

// Gateway mode ops (Request5M -- lowercase columns)
Request5M
| where TIMESTAMP between (datetime({start}) .. datetime({end}))
| where globalDatabaseAccountName == '{account}'
| where userAgent has 'gw'
| extend Workload = extract('(dr-[a-z-]+-[a-z]+)', 1, userAgent)
| summarize Total=sum(SampleCount), Success=sumif(SampleCount, statusCode < 400) by Workload
| extend SuccessRate=round(100.0 * Success / Total, 3)

// Write region transitions (MgmtDatabaseAccountTrace)
MgmtDatabaseAccountTrace
| where TIMESTAMP between (datetime({start}) .. datetime({end}))
| where GlobalDatabaseAccount == '{account}'
| project TIMESTAMP, Location, LocationType, FederationId, Status

Changes

1 file changed, 10 insertions (GlobalEndpointManager.java)
1 file changed, 50 insertions, 1 deletion (GlobalEndPointManagerTest.java)

…unts In multi-writer accounts, refreshLocationPrivateAsync() stops the background refresh timer when shouldRefreshEndpoints() returns false. This means topology changes (e.g., multi-write to single-write transitions) go undetected until the next explicit refresh trigger. The .NET SDK (azure-cosmos-dotnet-v3) correctly continues the background refresh loop unconditionally - the loop only stops when canRefreshInBackground is explicitly false, not when shouldRefreshEndpoints returns false. This fix adds startRefreshLocationTimerAsync() to the else-branch of refreshLocationPrivateAsync(), ensuring the background timer always reschedules itself regardless of whether endpoints currently need refreshing. Without this fix, after a multi-write -> single-write -> multi-write transition, reads remain stuck on the primary region because the SDK never re-reads account metadata to learn about the restored multi-write topology. Unit tests updated: - backgroundRefreshForMultiMaster: assertTrue (timer must keep running) - backgroundRefreshDetectsTopologyChangeForMultiMaster: new test proving MW->SW transition detection via mock Related: PR Azure#6139 (point #4 in description acknowledged this bug) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…W switch, SW offline) Kusto-backed evidence with charts for PR Azure#48758 validation. Accounts: bgrefresh-mw-test-440 (multi-writer), bgrefresh-sw-test-440 (single-writer) Branch: fix/background-refresh-multi-writer @ 2048abe Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…tions, SW switch, SW offline)" This reverts commit c9fc5c4.

jeet1995 · 2026-04-11T01:27:25Z

/azp run java - cosmos - tests

azure-pipelines · 2026-04-11T01:27:51Z

Azure Pipelines successfully started running 1 pipeline(s).

jeet1995 added the Cosmos label Apr 10, 2026

jeet1995 force-pushed the fix/background-refresh-multi-writer branch from c95fb7b to 2048abe Compare April 10, 2026 20:51

jeet1995 and others added 2 commits April 10, 2026 20:57

Revert "Add DR drill test results (4 scenarios: MW offline, MW transi…

a7fa9a7

…tions, SW switch, SW offline)" This reverts commit c9fc5c4.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix background database account refresh stopping in multi-writer accounts#48758

Fix background database account refresh stopping in multi-writer accounts#48758
jeet1995 wants to merge 3 commits intoAzure:mainfrom
jeet1995:fix/background-refresh-multi-writer

jeet1995 commented Apr 10, 2026 •

edited

Loading

Uh oh!

jeet1995 commented Apr 11, 2026

Uh oh!

azure-pipelines bot commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jeet1995 commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Root Cause

Behavioral Difference with .NET SDK

Fix

Unit Tests

Live DR Drill Validation (4 Scenarios)

Accounts

Scenario 1: MW -- Offline Secondary Region

Scenario 2: MW -- MW-to-SW-to-MW Transition (Core PR validation)

Scenario 3: SW -- Switch Write Region

Scenario 4: SW -- Offline Write Region

Backend Success Rates

Direct mode (BackendEndRequest5M)

Gateway mode (Request5M)

Verdict

Changes

Uh oh!

jeet1995 commented Apr 11, 2026

Uh oh!

azure-pipelines bot commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jeet1995 commented Apr 10, 2026 •

edited

Loading

Direct mode (`BackendEndRequest5M`)

Gateway mode (`Request5M`)